Skip to content

Conversation

@aebrahim
Copy link

We keep the same 32 digit string hexadecimal format.

Fixes #510.
Relevant to #453 and apache/beam#21298

We keep the same 32 digit string hexadecimal format.

Fixes cloudpipe#510.
Relevant to cloudpipe#453 and apache/beam#21298
@claudevdm
Copy link

Hi @aebrahim , this approach leads to errors in Apache Beam.

For example, in Apache beam

  1. cloudpickle is used to serialize dynamic types during pipeline submission time (could be on a user's workstation)
  2. distributed workers also use cloudpickle to serialize dynamic types

During step (1) some user defined type will claim the tracking id 1
During step (2) the worker imports cloudpickle library and serializes some dynamic type, but it has no idea that tracking id 1 was already claimed in a separate python process. Now there are two types encoded with the tracking id 1.

@aebrahim
Copy link
Author

aebrahim commented Jun 3, 2025

Got it, so the approach taken needs to be some kind of deterministic hash?

@claudevdm
Copy link

Yes. Perhaps there are ways to strike a balance here depending on the use case

  • The user of cloudpickle can pass their own "generate id" function to be used when pickling a dynamic type, while the default stays uuid (which is the proven way to guarantee isinstance semantics)
  • There could be an option to forego isinstance semantics completely (dont use tracking id's at all and always create a new type) when determinism is preferred vs isinstance semantics

@ogrisel
Copy link
Contributor

ogrisel commented Nov 3, 2025

Maybe dynamic classes could be pickled into a two stage process:

  • whenever a dynamic class identifies is need, the dynamic class definition (alone) is cloudpickled into a fake file object that incrementally computes a hash of the bytes instead of saving them to disk;
  • then the pickling proceed as usual using the computed hash digest as identifier in place of the current random uuid.

This has some overhead (because of redundant object graph handling and extra hash function calls), so maybe this should only be implemented via a dedicated option not enabled by default to avoid introducing a performance regression for existing cloudpickle users who do not need deterministic pickling.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 3, 2025

Related effort (to deal with the lack of determinism of co_filename in ipykernel execution environments): #560.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Random class_tracker_id for dynamic class

3 participants